Trabajo Práctico: Visualizando la oferta de Airbnb en la Ciudad de Buenos Aires

Alumnos

  • JosΓ© Ignacio Benavente
  • Cristian GastΓ³n Risuleo

Objetivo

El objetivo de este trabajo prΓ‘ctico es aplicar tΓ©cnicas de anΓ‘lisis exploratorio y visualizaciΓ³n de datos sobre un conjunto de datos real, utilizando buenas prΓ‘cticas vistas en la materia. Los estudiantes deberΓ‘n desarrollar habilidades para extraer, interpretar y comunicar informaciΓ³n relevante a travΓ©s de grΓ‘ficos eficaces.

Consigna

Utilizando el dataset listings.csv de Airbnb Buenos Aires, deberΓ‘n:

  1. Explorar y comprender el dataset.
    • Analizar los atributos disponibles: tipos de datos, valores nulos, variables categΓ³ricas y numΓ©ricas, etc.
  2. Realizar un anΓ‘lisis exploratorio de datos (EDA)
    • Detectar patrones generales, distribuciones, relaciones entre variables y outliers.
  3. Formular preguntas interesantes basadas en el dataset.
  4. Visualizar los datos.
    • Crear visualizaciones estΓ‘ticas o interactivas que respondan a las preguntas planteadas.
    • Aplicar buenas prΓ‘cticas en el diseΓ±o de grΓ‘ficos: elecciΓ³n adecuada de tipos de grΓ‘ficos, escalas, colores, tΓ­tulos, leyendas, etc.
  5. PresentaciΓ³n oral en clase.
    • Preparar una presentaciΓ³n (10-15 minutos) para exponer el trabajo realizado, las preguntas analizadas y las visualizaciones creadas.
    • Deben justificar sus decisiones de anΓ‘lisis y visualizaciΓ³n.

In [112]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
import folium
from folium.plugins import HeatMap
from IPython.display import display, Markdown
import matplotlib.cm as cm
import matplotlib.colors as colors
import altair as alt
import pandas as pd

plt.rcParams['figure.figsize'] = (12, 8)

1. Explorar y comprender el dataset

Analizar los atributos disponibles: tipos de datos, valores nulos, variables categΓ³ricas y numΓ©ricas, etc.

In [113]:
listings_df = pd.read_csv('listings.csv')
reference_df = pd.read_csv('reference.csv')

print("Dataset 'listings_df.csv':\n")
print(f"Forma del dataset: {listings_df.shape}\n")
print(f"Columnas: {list(listings_df.columns)}\n")
print("Primeras 5 filas:")
listings_df.head()
Dataset 'listings_df.csv':

Forma del dataset: (35172, 18)

Columnas: ['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'number_of_reviews_ltm', 'license']

Primeras 5 filas:
Out[113]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm license
0 11508 Amazing Luxurious Apt-Palermo Soho 42762 Candela NaN Palermo -34.581840 -58.424150 Entire home/apt 67518.0 3 44 2025-01-26 0.29 1 300 5 NaN
1 14222 RELAX IN HAPPY HOUSE - PALERMO, BUENOS AIRES 87710233 MarΓ­a NaN Palermo -34.586170 -58.410360 Entire home/apt 22375.0 7 123 2025-01-18 0.80 6 44 8 NaN
2 15074 ROOM WITH RIVER SIGHT 59338 Monica NaN NuΓ±ez -34.538920 -58.465990 Private room NaN 29 0 NaN NaN 1 0 0 NaN
3 16695 DUPLEX LOFT 2 - SAN TELMO 64880 Elbio Mariano NaN Monserrat -34.614390 -58.376110 Entire home/apt 52511.0 2 45 2019-11-30 0.27 9 365 0 NaN
4 20062 PENTHOUSE /Terrace & pool /City views /2bedrooms 75891 Sergio NaN Palermo -34.581848 -58.441605 Entire home/apt 113360.0 2 330 2025-01-17 1.84 4 209 25 NaN
In [114]:
print("Dataset 'reference.csv':\n")
print(f"Forma del dataset: {reference_df.shape}\n")
print(f"Columnas: {list(reference_df.columns)}\n")
print("Filas:")
reference_df
Dataset 'reference.csv':

Forma del dataset: (17, 4)

Columnas: ['Field', 'Type', 'Calculated', 'Description']

Filas:
Out[114]:
Field Type Calculated Description
0 id integer NaN Airbnb's unique identifier for the listing
1 name string NaN NaN
2 host_id integer NaN NaN
3 host_name string NaN NaN
4 neighbourhood_group text y The neighbourhood group as geocoded using the ...
5 neighbourhood text y The neighbourhood as geocoded using the latitu...
6 latitude numeric NaN Uses the World Geodetic System (WGS84) project...
7 longitude NaN NaN Uses the World Geodetic System (WGS84) project...
8 room_type string NaN NaN
9 price currency NaN daily price in local currency. Note, $ sign ma...
10 minimum_nights integer NaN minimum number of night stay for the listing (...
11 number_of_reviews integer NaN The number of reviews the listing has
12 last_review date y The date of the last/newest review
13 calculated_host_listings_count integer y The number of listings the host has in the cur...
14 availability_365 integer y avaliability_x. The availability of the listin...
15 number_of_reviews_ltm integer y The number of reviews the listing has (in the ...
16 license string NaN NaN
In [115]:
print("Dimensiones del dataset")
print(f"{listings_df.shape[0]:,} filas y {listings_df.shape[1]} columnas")

print("\nInformaciΓ³n de tipos de datos\n")
listings_df.info()

print("\nEstadΓ­sticas descriptivas\n")
display(listings_df.describe())
Dimensiones del dataset
35,172 filas y 18 columnas

InformaciΓ³n de tipos de datos

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35172 entries, 0 to 35171
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              35172 non-null  int64  
 1   name                            35172 non-null  object 
 2   host_id                         35172 non-null  int64  
 3   host_name                       35166 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   35172 non-null  object 
 6   latitude                        35172 non-null  float64
 7   longitude                       35172 non-null  float64
 8   room_type                       35172 non-null  object 
 9   price                           31598 non-null  float64
 10  minimum_nights                  35172 non-null  int64  
 11  number_of_reviews               35172 non-null  int64  
 12  last_review                     29412 non-null  object 
 13  reviews_per_month               29412 non-null  float64
 14  calculated_host_listings_count  35172 non-null  int64  
 15  availability_365                35172 non-null  int64  
 16  number_of_reviews_ltm           35172 non-null  int64  
 17  license                         390 non-null    object 
dtypes: float64(5), int64(7), object(6)
memory usage: 4.8+ MB

EstadΓ­sticas descriptivas

id host_id neighbourhood_group latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm
count 3.517200e+04 3.517200e+04 0.0 35172.000000 35172.000000 3.159800e+04 35172.000000 35172.000000 29412.000000 35172.000000 35172.000000 35172.000000
mean 7.117880e+17 2.143294e+08 NaN -34.591554 -58.417289 9.548776e+04 6.159871 28.027579 1.325920 15.646878 206.609320 9.500625
std 4.840974e+17 2.027420e+08 NaN 0.018257 0.030169 1.402656e+06 26.072002 45.151666 1.336446 34.108688 126.189102 13.991379
min 1.150800e+04 1.342600e+04 NaN -34.693700 -58.530890 2.600000e+02 1.000000 0.000000 0.010000 1.000000 0.000000 0.000000
25% 4.719776e+07 3.070016e+07 NaN -34.602670 -58.437050 2.971100e+04 1.000000 2.000000 0.360000 1.000000 88.000000 0.000000
50% 8.875662e+17 1.421004e+08 NaN -34.590890 -58.418914 3.990800e+04 2.000000 11.000000 0.930000 2.000000 231.000000 4.000000
75% 1.095206e+18 4.298024e+08 NaN -34.581047 -58.392114 5.776200e+04 4.000000 36.000000 1.910000 12.000000 333.000000 13.000000
max 1.344330e+18 6.754917e+08 NaN -34.534980 -58.355403 1.050217e+08 1000.000000 992.000000 26.080000 222.000000 365.000000 340.000000

Análisis de Variables Numéricas y Categóricas

A su vez las variables numΓ©ricas y categΓ³ricas se las clasifica segΓΊn las escalas de mediciΓ³n planteadas por S.S. Stevens:

NΒΊ Variable Tipo de Variable Subtipo
0 id NumΓ©rica Discreta
1 name CategΓ³rica Nominal
2 host_id NumΓ©rica Discreta
3 host_name CategΓ³rica Nominal
4 neighbourhood_group CategΓ³rica Nominal
5 neighbourhood CategΓ³rica Nominal
6 latitude NumΓ©rica Continua
7 longitude NumΓ©rica Continua
8 room_type CategΓ³rica Nominal
9 price NumΓ©rica Continua
10 minimum_nights NumΓ©rica Discreta
11 number_of_reviews NumΓ©rica Discreta
12 last_review CategΓ³rica Nominal
13 reviews_per_month NumΓ©rica Continua
14 calculated_host_listings_count NumΓ©rica Discreta
15 availability_365 NumΓ©rica Discreta
16 number_of_reviews_ltm NumΓ©rica Discreta
17 license CategΓ³rica Nominal
In [116]:
print("DescripciΓ³n de las columnas:")
column_descriptions = dict(zip(reference_df['Field'], reference_df['Description']))
for col in listings_df.columns:
    if col in column_descriptions:
        print(f"β€’ {col}: {column_descriptions[col]}")
    else:
        print(f"β€’ {col}: (No hay descripciΓ³n disponible)")
DescripciΓ³n de las columnas:
β€’ id: Airbnb's unique identifier for the listing
β€’ name: nan
β€’ host_id: nan
β€’ host_name: nan
β€’ neighbourhood_group: The neighbourhood group as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.
β€’ neighbourhood: The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.
β€’ latitude: Uses the World Geodetic System (WGS84) projection for latitude and longitude.
β€’ longitude: Uses the World Geodetic System (WGS84) projection for latitude and longitude.
β€’ room_type: nan
β€’ price: daily price in local currency. Note, $ sign may be used despite locale
β€’ minimum_nights: minimum number of night stay for the listing (calendar rules may be different)
β€’ number_of_reviews: The number of reviews the listing has
β€’ last_review: The date of the last/newest review
β€’ reviews_per_month: (No hay descripciΓ³n disponible)
β€’ calculated_host_listings_count: The number of listings the host has in the current scrape, in the city/region geography.
β€’ availability_365: avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may be available because it has been booked by a guest or blocked by the host.
β€’ number_of_reviews_ltm: The number of reviews the listing has (in the last 12 months)
β€’ license: nan
In [117]:
print("AnΓ‘lisis de la cantidad valores nulos por columna:")

# AnΓ‘lisis de valores nulos por columna
null_analysis = pd.DataFrame({
    'Columna': listings_df.columns,
    'Valores_Nulos': listings_df.isnull().sum(),
    'Porcentaje_Nulos': (listings_df.isnull().sum() / len(listings_df)) * 100
}).sort_values('Porcentaje_Nulos', ascending=False)

# Mostrar tabla con columnas que tienen valores nulos
display(null_analysis[null_analysis['Valores_Nulos'] > 0])

# Preparar datos para grΓ‘ficos
null_counts = listings_df.isnull().sum()
null_counts = null_counts[null_counts > 0].sort_values(ascending=False)

total_columns = listings_df.shape[1]
columns_with_nulls = (listings_df.isnull().sum() > 0).sum()
columns_without_nulls = total_columns - columns_with_nulls

# Crear figura con dos subplots horizontales
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# GrΓ‘fico de barras
sns.barplot(
    x=null_counts.values.tolist(),
    y=null_counts.index.tolist(),
    ax=axes[0],
    palette="Reds_d"
)
axes[0].set_title('Cantidad de Valores Nulos por Columna')
axes[0].set_xlabel('NΓΊmero de Valores Nulos')

# GrΓ‘fico de torta
axes[1].pie(
    [columns_with_nulls, columns_without_nulls],
    labels=['Con Valores Nulos', 'Sin Valores Nulos'],
    autopct='%1.1f%%',
    colors=['#e74c3c', '#2ecc71'],
    startangle=140,
    explode=(0.05, 0)
)
axes[1].set_title('Porcentaje de Columnas con/sin Valores Nulos')

plt.tight_layout()
plt.show()
AnΓ‘lisis de la cantidad valores nulos por columna:
Columna Valores_Nulos Porcentaje_Nulos
neighbourhood_group neighbourhood_group 35172 100.000000
license license 34782 98.891163
reviews_per_month reviews_per_month 5760 16.376663
last_review last_review 5760 16.376663
price price 3574 10.161492
host_name host_name 6 0.017059

2. Realizar un análisis exploratorio de datos (EDA)

Detectar patrones generales, distribuciones, relaciones entre variables y outliers.

In [118]:
duplicates = listings_df.duplicated().sum()
print(f"\nNΓΊmero de filas duplicadas: {duplicates}")

print("\nCambio de formato para columna: 'last_review' de object a datetime\n")
df_clean = listings_df.copy()
df_clean['last_review'] = pd.to_datetime(df_clean['last_review'], errors='coerce')

print(df_clean.dtypes)
NΓΊmero de filas duplicadas: 0

Cambio de formato para columna: 'last_review' de object a datetime

id                                         int64
name                                      object
host_id                                    int64
host_name                                 object
neighbourhood_group                      float64
neighbourhood                             object
latitude                                 float64
longitude                                float64
room_type                                 object
price                                    float64
minimum_nights                             int64
number_of_reviews                          int64
last_review                       datetime64[ns]
reviews_per_month                        float64
calculated_host_listings_count             int64
availability_365                           int64
number_of_reviews_ltm                      int64
license                                   object
dtype: object
In [119]:
# AnΓ‘lisis de variables numΓ©ricas
print("AnΓ‘lisis de variables numΓ©ricas\n")

numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
print(f"Variables numΓ©ricas: {list(numeric_cols)}")

# EstadΓ­sticas descriptivas
print("\nEstadΓ­sticas Descriptivas:")
display(df_clean[numeric_cols].describe())

numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
corr_data = df_clean[numeric_cols].corr()

print("\nMatriz de CorrelaciΓ³n entre Variables NumΓ©ricas:")
plt.figure(figsize=(12, 10))
sns.heatmap(corr_data, annot=True, cmap='coolwarm', center=0, 
           square=True, linewidths=0.5)
plt.tight_layout()
plt.show()

price_corr = corr_data['price'].abs().sort_values(ascending=False)
AnΓ‘lisis de variables numΓ©ricas

Variables numΓ©ricas: ['id', 'host_id', 'neighbourhood_group', 'latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'number_of_reviews_ltm']

EstadΓ­sticas Descriptivas:
id host_id neighbourhood_group latitude longitude price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm
count 3.517200e+04 3.517200e+04 0.0 35172.000000 35172.000000 3.159800e+04 35172.000000 35172.000000 29412.000000 35172.000000 35172.000000 35172.000000
mean 7.117880e+17 2.143294e+08 NaN -34.591554 -58.417289 9.548776e+04 6.159871 28.027579 1.325920 15.646878 206.609320 9.500625
std 4.840974e+17 2.027420e+08 NaN 0.018257 0.030169 1.402656e+06 26.072002 45.151666 1.336446 34.108688 126.189102 13.991379
min 1.150800e+04 1.342600e+04 NaN -34.693700 -58.530890 2.600000e+02 1.000000 0.000000 0.010000 1.000000 0.000000 0.000000
25% 4.719776e+07 3.070016e+07 NaN -34.602670 -58.437050 2.971100e+04 1.000000 2.000000 0.360000 1.000000 88.000000 0.000000
50% 8.875662e+17 1.421004e+08 NaN -34.590890 -58.418914 3.990800e+04 2.000000 11.000000 0.930000 2.000000 231.000000 4.000000
75% 1.095206e+18 4.298024e+08 NaN -34.581047 -58.392114 5.776200e+04 4.000000 36.000000 1.910000 12.000000 333.000000 13.000000
max 1.344330e+18 6.754917e+08 NaN -34.534980 -58.355403 1.050217e+08 1000.000000 992.000000 26.080000 222.000000 365.000000 340.000000
Matriz de CorrelaciΓ³n entre Variables NumΓ©ricas:
In [120]:
print("AnΓ‘lisis de Precios:\n")

# Filtrar precios vΓ‘lidos (no nulos y mayores a 0)
valid_prices = df_clean['price'].dropna()
valid_prices = valid_prices[valid_prices > 0]

print(f"Precio promedio: ${valid_prices.mean():,.2f}")
print(f"Precio mediano: ${valid_prices.median():,.2f}")
print(f"Precio mΓ­nimo: ${valid_prices.min():,.2f}")
print(f"Precio mΓ‘ximo: ${valid_prices.max():,.2f}")
print(f"DesviaciΓ³n estΓ‘ndar: ${valid_prices.std():,.2f}")

plt.figure(figsize=(15, 10))


plt.subplot(2, 2, 1)
plt.hist(valid_prices.values, bins=50, alpha=0.7, color='skyblue', edgecolor='black')
plt.title('DistribuciΓ³n de Precios')
plt.xlabel('Precio ($)')
plt.ylabel('Frecuencia')


plt.subplot(2, 2, 2)
plt.boxplot(valid_prices.values)
plt.title('Boxplot de Precios')
plt.ylabel('Precio ($)')


plt.subplot(2, 2, 3)
q95 = valid_prices.quantile(0.95)
filtered_prices = valid_prices[valid_prices <= q95]
plt.hist(filtered_prices.values, bins=50, alpha=0.7, color='lightgreen', edgecolor='black')
plt.title('DistribuciΓ³n de Precios (sin 5% superior)')
plt.xlabel('Precio ($)')
plt.ylabel('Frecuencia')


plt.subplot(2, 2, 4)
if 'room_type' in df_clean.columns:
    df_price_room = df_clean[df_clean['price'].notna() & (df_clean['price'] > 0)]
    sns.boxplot(data=df_price_room, x='room_type', y='price')
    plt.title('Precios por Tipo de HabitaciΓ³n')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()
AnΓ‘lisis de Precios:

Precio promedio: $95,487.76
Precio mediano: $39,908.00
Precio mΓ­nimo: $260.00
Precio mΓ‘ximo: $105,021,704.00
DesviaciΓ³n estΓ‘ndar: $1,402,656.42
In [121]:
print("AnΓ‘lisis de Outliers:")

def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound


key_numeric_cols = ['price', 'minimum_nights', 'number_of_reviews', 'availability_365']
key_numeric_cols = [col for col in key_numeric_cols if col in df_clean.columns]

for col in key_numeric_cols:
    if df_clean[col].notna().any():
        clean_data = df_clean.dropna(subset=[col])
        if col == 'price':
            clean_data = clean_data[clean_data[col] > 0]  # Filtrar precios vΓ‘lidos
        
        outliers, lower, upper = detect_outliers_iqr(clean_data, col)
        
        print(f"\n{col.upper()}:")
        print(f"  Rango normal: {lower:.2f} - {upper:.2f}")
        print(f"  Outliers detectados: {len(outliers)} ({len(outliers)/len(clean_data)*100:.1f}%)")
        
        if len(outliers) > 0:
            print(f"  Valores extremos: {clean_data[col].min():.2f} - {clean_data[col].max():.2f}")
            
            
   
host_data = df_clean['calculated_host_listings_count'].dropna()

plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)

plt.hist(host_data.values, bins=30, alpha=0.7, color='purple', edgecolor='black')
plt.title('DistribuciΓ³n de Listados por Host')
plt.xlabel('NΓΊmero de Listados por Host')
plt.ylabel('Frecuencia')

plt.subplot(1, 2, 2)

filtered_host_data = host_data[host_data <= host_data.quantile(0.95)]

plt.hist(filtered_host_data.values, bins=20, alpha=0.7, color='mediumpurple', edgecolor='black')
plt.title('DistribuciΓ³n de Listados por Host (sin 5% superior)')
plt.xlabel('NΓΊmero de Listados por Host')
plt.ylabel('Frecuencia')

plt.tight_layout()
plt.show()
AnΓ‘lisis de Outliers:

PRICE:
  Rango normal: -12365.50 - 99838.50
  Outliers detectados: 2693 (8.5%)
  Valores extremos: 260.00 - 105021704.00

MINIMUM_NIGHTS:
  Rango normal: -3.50 - 8.50
  Outliers detectados: 2950 (8.4%)
  Valores extremos: 1.00 - 1000.00

NUMBER_OF_REVIEWS:
  Rango normal: -49.00 - 87.00
  Outliers detectados: 2709 (7.7%)
  Valores extremos: 0.00 - 992.00

AVAILABILITY_365:
  Rango normal: -279.50 - 700.50
  Outliers detectados: 0 (0.0%)
In [122]:
print("Top 10 Hosts con MΓ‘s Departamentos en Alquiler:\n")
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

top_hosts = df_clean['host_id'].value_counts().head(10)
plt.figure(figsize=(10, 6))
plt.barh(top_hosts.index.astype(str), top_hosts.values, color='gray')
plt.xlabel("Cantidad de Departamentos")
plt.ylabel("Host ID")
plt.title("Top 10 Hosts con MΓ‘s Departamentos")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
Top 10 Hosts con MΓ‘s Departamentos en Alquiler:

In [123]:
print("DistribuciΓ³n de Tipos de HabitaciΓ³n:\n")
room_counts = df_clean['room_type'].value_counts()
total = room_counts.sum()
percentages = (room_counts / total * 100).round(1)
plt.figure(figsize=(10, 6))
sns.barplot(x=room_counts.values, y=room_counts.index, palette='Set2')

for i, (count, pct) in enumerate(zip(room_counts.values, percentages.values)):
    label = f"{count} ({pct}%)"
    plt.text(count + total * 0.01, i, label, va='center', fontweight='bold')

plt.xlabel('Cantidad de Alojamientos')
plt.ylabel('Tipo de HabitaciΓ³n')
plt.tight_layout()
plt.show()
DistribuciΓ³n de Tipos de HabitaciΓ³n:

In [124]:
print(f"Total de barrios ΓΊnicos: {df_clean['neighbourhood'].nunique()}\n")

top_neighbourhoods = df_clean['neighbourhood'].value_counts().head(15)

print("Top 15 barrios con mΓ‘s alojamientos:\n")
plt.figure(figsize=(15, 8))

sns.barplot(x=top_neighbourhoods.values, y=top_neighbourhoods.index.tolist())
plt.xlabel('NΓΊmero de alojamientos')
plt.ylabel('Barrio')


for i, v in enumerate(top_neighbourhoods.values):
    plt.text(v + 10, i, str(v), va='center')

plt.tight_layout()
plt.show()
Total de barrios ΓΊnicos: 48

Top 15 barrios con mΓ‘s alojamientos:

In [125]:
print("AnΓ‘lisis de ReseΓ±as\n")

reviews_data = df_clean['number_of_reviews'].dropna()

print(f"Promedio de reseΓ±as por alojamiento: {reviews_data.mean():.2f}")
print(f"Mediana de reseΓ±as: {reviews_data.median():.2f}")
print(f"Alojamientos sin reseΓ±as: {(reviews_data == 0).sum()} ({(reviews_data == 0).mean()*100:.1f}%)")

plt.figure(figsize=(15, 5))


plt.subplot(1, 2, 1)

plt.hist(reviews_data.values, bins=50, alpha=0.7, color='orange', edgecolor='black')
plt.title('DistribuciΓ³n del NΓΊmero de ReseΓ±as')
plt.xlabel('NΓΊmero de ReseΓ±as')
plt.ylabel('Frecuencia')


plt.tight_layout()
plt.show()
AnΓ‘lisis de ReseΓ±as

Promedio de reseΓ±as por alojamiento: 28.03
Mediana de reseΓ±as: 11.00
Alojamientos sin reseΓ±as: 5760 (16.4%)
In [126]:
print("AnΓ‘lisis GeogrΓ‘fico\n")

geo_data = df_clean.dropna(subset=['latitude', 'longitude'])

print(f"Rango de latitud: {geo_data['latitude'].min():.4f} a {geo_data['latitude'].max():.4f}")
print(f"Rango de longitud: {geo_data['longitude'].min():.4f} a {geo_data['longitude'].max():.4f}")

geo_price_data = geo_data.dropna(subset=['price'])
geo_price_data = geo_price_data[geo_price_data['price'] > 0]


p5 = geo_price_data['price'].quantile(0.05)
p95 = geo_price_data['price'].quantile(0.95)
geo_price_data = geo_price_data[(geo_price_data['price'] >= p5) & (geo_price_data['price'] <= p95)]


center_lat = geo_price_data['latitude'].mean()
center_lon = geo_price_data['longitude'].mean()

mapa = folium.Map(
    location=[center_lat, center_lon], 
    zoom_start=12,
    tiles='CartoDB positron'
)


precios_norm = (geo_price_data['price'] - geo_price_data['price'].min()) / (geo_price_data['price'].max() - geo_price_data['price'].min())

# Crear lista de puntos [lat, lon, intensidad]
heat_data = []
for idx, row in geo_price_data.iterrows():
    # Usar precio normalizado como intensidad
    intensidad = precios_norm[idx]
    heat_data.append([row['latitude'], row['longitude'], intensidad])

# Agregar mapa de calor
HeatMap(
    heat_data,
    min_opacity=0.3,        # Opacidad mΓ­nima aumentada para mejor visibilidad
    max_zoom=18,           # Zoom mΓ‘ximo
    radius=20,             # Radio de influencia aumentado
    blur=15,               # Efecto de difuminado aumentado
    gradient={             # Gradiente de colores mejorado
        0.0: 'darkblue',   # Precios mΓ‘s bajos (P5)
        0.2: 'blue',       
        0.4: 'cyan',
        0.6: 'lime',
        0.8: 'yellow',
        0.9: 'orange',
        1.0: 'red'         # Precios mΓ‘s altos (P95)
    }
).add_to(mapa)

legend_html = '''
<div style="position: fixed; 
     bottom: 50px; left: 50px; width: 240px; height: 180px; 
     background-color: white; border:3px solid #333; z-index:9999; 
     font-size:14px; padding: 15px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.3);">
<p style="margin: 0 0 10px 0; font-weight: bold; font-size: 16px;">πŸ”₯ Mapa de Calor - Precios</p>
<p style="margin: 0 0 8px 0; font-size: 12px; color: #666;">Filtrado entre P5 - P95</p>
<div style="background: linear-gradient(to right, darkblue, blue, cyan, lime, yellow, orange, red); height: 20px; width: 100%; margin: 10px 0; border-radius: 5px;"></div>
<div style="display: flex; justify-content: space-between; font-size: 11px; margin-bottom: 8px;">
    <span>Bajo</span>
    <span>Medio</span>
    <span>Alto</span>
</div>
<p style="margin: 5px 0; font-size: 12px;">
    <b>Propiedades:</b> ''' + str(len(geo_price_data)) + '''
</p>
</div>
'''

mapa.get_root().html.add_child(folium.Element(legend_html))


display(mapa)
AnΓ‘lisis GeogrΓ‘fico

Rango de latitud: -34.6937 a -34.5350
Rango de longitud: -58.5309 a -58.3554
Make this Notebook Trusted to load map: File -> Trust Notebook
In [127]:
geo_price_data = df_clean.dropna(subset=['latitude', 'longitude', 'price'])
geo_price_data = geo_price_data[geo_price_data['price'] > 0]

p95 = geo_price_data['price'].quantile(0.95)
p05 = geo_price_data['price'].quantile(0.05)
geo_price_data = geo_price_data[(geo_price_data['price'] <= p95) & (geo_price_data['price'] >= p05)]

center_lat = geo_price_data['latitude'].mean()
center_lon = geo_price_data['longitude'].mean()

mapa = folium.Map(location=[center_lat, center_lon], zoom_start=12, tiles='CartoDB positron')

norm = colors.Normalize(vmin=geo_price_data['price'].min(), vmax=geo_price_data['price'].max())
cmap = cm.get_cmap('YlOrRd')

sample_data = geo_price_data.sample(min(500, len(geo_price_data)))

def generar_tooltip(row):
    return (
        f"<div style='min-width: 220px;'>"
        f"<b>Nombre:</b> <span style='font-weight:bold'>{row['name']}</span><br>"
        f"<b>Precio:</b> <span style='font-weight:bold'>${row['price']:,.0f}</span><br>"
        f"<b>Barrio:</b> <span style='font-weight:bold'>{row['neighbourhood']}</span><br>"
        f"<b>Tipo:</b> <span style='font-weight:bold'>{row['room_type']}</span>"
        f"</div>"
    )

for _, row in sample_data.iterrows():
    color = colors.to_hex(cmap(norm(row['price'])))
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=5,
        color=color,
        fill=True,
        fillColor=color,
        fill_opacity=0.8,
        tooltip=folium.Tooltip(generar_tooltip(row), sticky=True, direction='top')
    ).add_to(mapa)

# Marcador de precio mΓ­nimo
min_row = geo_price_data.loc[geo_price_data['price'].idxmin()]
folium.Marker(
    location=[min_row['latitude'], min_row['longitude']],
    icon=folium.Icon(color='green', icon='arrow-down', prefix='fa'),
    tooltip=folium.Tooltip(generar_tooltip(min_row), sticky=True, direction='top')
).add_to(mapa)

# Marcador de precio mΓ‘ximo
max_row = geo_price_data.loc[geo_price_data['price'].idxmax()]
folium.Marker(
    location=[max_row['latitude'], max_row['longitude']],
    icon=folium.Icon(color='red', icon='arrow-up', prefix='fa'),
    tooltip=folium.Tooltip(generar_tooltip(max_row), sticky=True, direction='top')
).add_to(mapa)

min_price = int(geo_price_data['price'].min())
max_price = int(geo_price_data['price'].max())

legend_html = f'''
<div style="position: fixed; 
     bottom: 50px; left: 50px; width: 240px; height: 130px; 
     background-color: white; border:3px solid #333; z-index:9999; 
     font-size:14px; padding: 10px 15px; border-radius: 10px; box-shadow: 0 4px 8px rgba(0,0,0,0.3);">
<p style="margin: 0 0 8px 0; font-weight: bold; font-size: 15px;">πŸ’° Escala de Precios por Noche</p>
<p style="margin: 0 0 8px 0; font-size: 12px; color: #666;">Filtrado entre P5 - P95</p>
<div style="display: flex; align-items: center;">
    <span style="flex: 1; text-align: left;">${min_price:,.0f}</span>
    <div style="flex: 6; height: 15px; background: linear-gradient(to right, #ffffcc, #ffeda0, #feb24c, #f03b20); margin: 0 10px;"></div>
    <span style="flex: 1; text-align: right;">${max_price:,.0f}</span>
</div>
</div>
'''
mapa.get_root().html.add_child(folium.Element(legend_html))

display(mapa)
Make this Notebook Trusted to load map: File -> Trust Notebook

3. Formular preguntas interesantes basadas en el dataset.

Las preguntas que surgen del anΓ‘lisis exploratorio son:

  1. CuΓ‘les son los barrios mΓ‘s caros y mΓ‘s baratos?
  2. Existe relaciΓ³n entre el nΓΊmero de reseΓ±as y el precio?
  3. Cuales son los barrios que mΓ‘s han crecido en el ultimo aΓ±o?
  4. CuΓ‘l es la relaciΓ³n entre disponibilidad y precio?

1. Cuáles son los barrios más caros y más baratos?

In [132]:
price_neighbourhood = df_clean.dropna(subset=['price', 'neighbourhood'])
price_neighbourhood = price_neighbourhood[price_neighbourhood['price'] > 0]


p5 = price_neighbourhood['price'].quantile(0.05)
p95 = price_neighbourhood['price'].quantile(0.95)
print(f" Filtrado de precios por barrio:")
print(f"Percentil 5 (mΓ­nimo): ${p5:.2f}")
print(f"Percentil 95 (mΓ‘ximo): ${p95:.2f}")
print(f"Propiedades antes del filtro: {len(price_neighbourhood):,}")
price_neighbourhood = price_neighbourhood[(price_neighbourhood['price'] >= p5) & (price_neighbourhood['price'] <= p95)]
print(f"Propiedades despuΓ©s del filtro: {len(price_neighbourhood):,}")

neighbourhood_stats = price_neighbourhood.groupby('neighbourhood').agg({
    'price': ['mean', 'median', 'count']
}).round(2)

neighbourhood_stats.columns = ['precio_promedio', 'precio_mediano', 'cantidad_listados']
neighbourhood_stats = neighbourhood_stats[neighbourhood_stats['cantidad_listados'] >= 5]
neighbourhood_stats = neighbourhood_stats.sort_values('precio_promedio', ascending=False)

print("\nTOP 10 BARRIOS MÁS CAROS (P5-P95):")
print(neighbourhood_stats.head(10))

print("\nTOP 10 BARRIOS MÁS BARATOS (P5-P95):")
print(neighbourhood_stats.tail(10))


print(f"\n ESTADÍSTICAS GENERALES (P5-P95):")
print(f"Barrios analizados: {len(neighbourhood_stats)}")
print(f"Precio promedio general: ${price_neighbourhood['price'].mean():.2f}")
print(f"Precio mediano general: ${price_neighbourhood['price'].median():.2f}")
print(f"Rango de precios: ${p5:.2f} - ${p95:.2f}")
 Filtrado de precios por barrio:
Percentil 5 (mΓ­nimo): $19950.25
Percentil 95 (mΓ‘ximo): $126786.10
Propiedades antes del filtro: 31,598
Propiedades despuΓ©s del filtro: 28,438

TOP 10 BARRIOS MÁS CAROS (P5-P95):
                 precio_promedio  precio_mediano  cantidad_listados
neighbourhood                                                      
Puerto Madero           85735.71         86044.5                244
Colegiales              50135.78         42009.0                639
Palermo                 50003.62         42775.0               9515
Recoleta                48438.58         42009.0               4192
NuΓ±ez                   48241.84         41615.0                600
Belgrano                47386.56         42009.0               1487
Retiro                  47053.62         39761.0               1396
Barracas                46651.39         42009.0                201
Villa Devoto            45521.31         39908.0                 85
Velez Sarsfield         45320.31         33008.0                 13

TOP 10 BARRIOS MÁS BARATOS (P5-P95):
                   precio_promedio  precio_mediano  cantidad_listados
neighbourhood                                                        
Agronomia                 37294.63         36758.0                 30
Villa Ortuzar             36582.41         31507.0                 93
Coghlan                   36427.05         32105.0                 95
Villa Del Parque          36359.93         31507.0                 94
San Cristobal             36138.82         31801.0                148
Villa Gral. Mitre         35362.00         32557.0                 25
Boedo                     35285.74         31086.0                 78
Liniers                   34866.81         30456.0                 21
Parque Avellaneda         33314.08         30535.0                 12
Nueva Pompeya             28458.00         24365.0                  5

 ESTADÍSTICAS GENERALES (P5-P95):
Barrios analizados: 44
Precio promedio general: $46361.65
Precio mediano general: $39908.00
Rango de precios: $19950.25 - $126786.10
In [129]:
plt.figure(figsize=(15, 12))


plt.subplot(2, 1, 1)
top_expensive = neighbourhood_stats.head(10)

sns.barplot(x=top_expensive['precio_promedio'].values, y=top_expensive.index.tolist())
plt.title('Top 10 Barrios MΓ‘s Caros (Precio Promedio)')
plt.xlabel('Precio Promedio ($)')


plt.subplot(2, 1, 2)
top_cheap = neighbourhood_stats.tail(10)

sns.barplot(x=top_cheap['precio_promedio'].values, y=top_cheap.index.tolist())
plt.title('Top 10 Barrios MΓ‘s Baratos (Precio Promedio)')
plt.xlabel('Precio Promedio ($)')

plt.tight_layout()
plt.show()
In [130]:
plt.figure(figsize=(15, 14))


plt.subplot(2, 1, 1)
top_expensive_names = neighbourhood_stats.head(10).index.tolist()
top_expensive_data = price_neighbourhood[price_neighbourhood['neighbourhood'].isin(top_expensive_names)]

order_expensive = neighbourhood_stats.head(10).index.tolist()
sns.boxplot(data=top_expensive_data, x='price', y='neighbourhood', 
           order=order_expensive, orient='h')
plt.title('Top 10 Barrios MΓ‘s Caros - DistribuciΓ³n de Precios (P5-P95)', fontsize=14, fontweight='bold')
plt.xlabel('Precio ($)')
plt.ylabel('Barrio')


plt.axvline(price_neighbourhood['price'].mean(), color='blue', linestyle='--', alpha=0.7, label='Promedio General')
plt.axvline(price_neighbourhood['price'].median(), color='green', linestyle='--', alpha=0.7, label='Mediana General')
plt.legend()


plt.subplot(2, 1, 2)
top_cheap_names = neighbourhood_stats.tail(10).index.tolist()
top_cheap_data = price_neighbourhood[price_neighbourhood['neighbourhood'].isin(top_cheap_names)]


order_cheap = neighbourhood_stats.tail(10).sort_values('precio_promedio', ascending=True).index.tolist()
sns.boxplot(data=top_cheap_data, x='price', y='neighbourhood', 
           order=order_cheap, orient='h')
plt.title('Top 10 Barrios MΓ‘s Baratos - DistribuciΓ³n de Precios (P5-P95)', fontsize=14, fontweight='bold')
plt.xlabel('Precio ($)')
plt.ylabel('Barrio')


plt.axvline(price_neighbourhood['price'].mean(), color='blue', linestyle='--', alpha=0.7, label='Promedio General')
plt.axvline(price_neighbourhood['price'].median(), color='green', linestyle='--', alpha=0.7, label='Mediana General')
plt.legend()

plt.tight_layout()
plt.show()
In [166]:
p5 = price_neighbourhood['price'].quantile(0.05)
p95 = price_neighbourhood['price'].quantile(0.95)
plt.axvline(p5, color='red', linestyle='--', linewidth=2, label='P5')
plt.axvline(p95, color='red', linestyle='--', linewidth=2, label='P95')
sns.distplot(palermo_filtered['price'], bins=30, kde=True, color='gray')
plt.text(p5, plt.ylim()[1]*0.8, 'P5', color='red', ha='left', fontsize=14, fontweight='bold')
plt.text(p95, plt.ylim()[1]*0.8, 'P95', color='red', ha='left', fontsize=14, fontweight='bold')
plt.title('Histograma de precios en Palermo (P5–P95)')
plt.xlabel('Precio ($)')
plt.ylabel('Cantidad de alojamientos')
plt.tight_layout()
plt.show()

2. Existe relación entre el número de reseñas y el precio?

In [154]:
review_price = df_clean.dropna(subset=['price', 'number_of_reviews'])
review_price = review_price[review_price['price'] > 0]

correlation = review_price['price'].corr(review_price['number_of_reviews'])

review_price['review_category'] = pd.cut(
    review_price['number_of_reviews'], 
    bins=[-1, 0, 10, 50, 100, float('inf')],  # Cambiado: -1 en lugar del primer 0
    labels=['Sin reseΓ±as', '1-10 reseΓ±as', '11-50 reseΓ±as', '51-100 reseΓ±as', '100+ reseΓ±as']
)

price_by_reviews = review_price.groupby('review_category')['price'].agg(['mean', 'median', 'count'])
print("\n Precio promedio por categorΓ­a de reseΓ±as:")
print(price_by_reviews)

print("\n CorrelaciΓ³n:")
print(correlation)
 Precio promedio por categorΓ­a de reseΓ±as:
                          mean   median  count
review_category                               
Sin reseΓ±as      240021.189201  42009.0   4445
1-10 reseΓ±as      87150.466342  40958.0  10280
11-50 reseΓ±as     66266.299070  38259.0  10964
51-100 reseΓ±as    56617.685471  38967.5   3882
100+ reseΓ±as      53324.053774  39310.0   2027

 CorrelaciΓ³n:
-0.01682546608709121
In [136]:
plt.figure(figsize=(16, 12))

plt.subplot(2, 2, 1)
plt.scatter(review_price['number_of_reviews'].values, review_price['price'].values, alpha=0.5, s=30)
plt.xlabel('NΓΊmero de ReseΓ±as')
plt.ylabel('Precio ($)')
plt.title(f'RelaciΓ³n Precio vs ReseΓ±as (r={correlation:.3f})', fontweight='bold')
plt.grid(True, alpha=0.3)

plt.subplot(2, 2, 2)
sns.boxplot(data=review_price, x='review_category', y='price', palette='Set2')
plt.xticks(rotation=45)
plt.title('DistribuciΓ³n de Precios por CategorΓ­a de ReseΓ±as', fontweight='bold')
plt.ylabel('Precio ($)')

plt.subplot(2, 2, 3)
q99_price = review_price['price'].quantile(0.99)
q99_reviews = review_price['number_of_reviews'].quantile(0.99)
filtered_data = review_price[
    (review_price['price'] <= q99_price) & 
    (review_price['number_of_reviews'] <= q99_reviews)
]

filtered_correlation = filtered_data['price'].corr(filtered_data['number_of_reviews'])

plt.scatter(filtered_data['number_of_reviews'].values, filtered_data['price'].values, 
           alpha=0.6, color='green', s=25)
plt.xlabel('NΓΊmero de ReseΓ±as')
plt.ylabel('Precio ($)')
plt.title(f'Sin Outliers Extremos (r={filtered_correlation:.3f})', fontweight='bold')
plt.grid(True, alpha=0.3)

plt.subplot(2, 2, 4)
avg_prices = price_by_reviews['mean']
colors = ['lightcoral', 'lightblue', 'lightgreen', 'gold', 'plum']
bars = plt.bar(range(len(avg_prices)), avg_prices.values, color=colors)

for i, (bar, value) in enumerate(zip(bars, avg_prices.values)):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + value*0.01, 
             f'${value:.0f}', ha='center', va='bottom', fontweight='bold')

plt.xticks(range(len(avg_prices)), avg_prices.index.tolist(), rotation=45)
plt.title('Precio Promedio por CategorΓ­a de ReseΓ±as', fontweight='bold')
plt.ylabel('Precio Promedio ($)')

plt.tight_layout()
plt.show()
In [137]:
# InformaciΓ³n adicional
print(f"\n ANÁLISIS DETALLADO:")
print(f"CorrelaciΓ³n completa: {correlation:.4f}")
print(f"CorrelaciΓ³n sin outliers: {filtered_correlation:.4f}")
print(f"Total de propiedades analizadas: {len(review_price):,}")
print(f"Propiedades sin outliers: {len(filtered_data):,}")

print(f"\n INSIGHTS POR CATEGORÍA:")
for category in price_by_reviews.index:
    mean_price = price_by_reviews.loc[category, 'mean']
    count = price_by_reviews.loc[category, 'count']
    median_price = price_by_reviews.loc[category, 'median']
    print(f"{category}:")
    print(f"  β€’ Precio promedio: ${mean_price:.2f}")
    print(f"  β€’ Precio mediano: ${median_price:.2f}")
    print(f"  β€’ Cantidad de listados: {count:,}")
    print(f"  β€’ Porcentaje del total: {(count/len(review_price)*100):.1f}%")
 ANÁLISIS DETALLADO:
CorrelaciΓ³n completa: -0.0168
CorrelaciΓ³n sin outliers: -0.0206
Total de propiedades analizadas: 31,598
Propiedades sin outliers: 30,968

 INSIGHTS POR CATEGORÍA:
Sin reseΓ±as:
  β€’ Precio promedio: $240021.19
  β€’ Precio mediano: $42009.00
  β€’ Cantidad de listados: 4,445
  β€’ Porcentaje del total: 14.1%
1-10 reseΓ±as:
  β€’ Precio promedio: $87150.47
  β€’ Precio mediano: $40958.00
  β€’ Cantidad de listados: 10,280
  β€’ Porcentaje del total: 32.5%
11-50 reseΓ±as:
  β€’ Precio promedio: $66266.30
  β€’ Precio mediano: $38259.00
  β€’ Cantidad de listados: 10,964
  β€’ Porcentaje del total: 34.7%
51-100 reseΓ±as:
  β€’ Precio promedio: $56617.69
  β€’ Precio mediano: $38967.50
  β€’ Cantidad de listados: 3,882
  β€’ Porcentaje del total: 12.3%
100+ reseΓ±as:
  β€’ Precio promedio: $53324.05
  β€’ Precio mediano: $39310.00
  β€’ Cantidad de listados: 2,027
  β€’ Porcentaje del total: 6.4%
In [155]:
# Calcular percentiles
p5_price, p95_price = price_neighbourhood['price'].quantile([0.05, 0.95])
p5_reviews, p95_reviews = price_neighbourhood['number_of_reviews'].quantile([0.05, 0.95])

# Filtrar por precio (P5-P95)
filtered_price = price_neighbourhood[
    (price_neighbourhood['price'] >= p5_price) & (price_neighbourhood['price'] <= p95_price)
]

# Filtrar por precio y nΓΊmero de reseΓ±as (P5-P95)
filtered_price_reviews = price_neighbourhood[
    (price_neighbourhood['price'] >= p5_price) & (price_neighbourhood['price'] <= p95_price) &
    (price_neighbourhood['number_of_reviews'] >= p5_reviews) & (price_neighbourhood['number_of_reviews'] <= p95_reviews)
]

# Crear los subplots con tamaΓ±o mΓ‘s grande
fig, axs = plt.subplots(3, 1, figsize=(12, 20))

# ConfiguraciΓ³n comΓΊn para fuentes mΓ‘s grandes
title_font = 16
label_font = 14
tick_font = 12

# GrΓ‘fico original
sns.scatterplot(
    data=price_neighbourhood,
    x='number_of_reviews',
    y='price',
    alpha=0.4,
    s=50,  # tamaΓ±o de los puntos
    ax=axs[0]
)
axs[0].set_title('RelaciΓ³n NΓΊmero de ReseΓ±as y Precio (Original)', fontsize=title_font)
axs[0].set_xlabel('NΓΊmero de ReseΓ±as', fontsize=label_font)
axs[0].set_ylabel('Precio ($)', fontsize=label_font)
axs[0].tick_params(labelsize=tick_font)

# GrΓ‘fico filtrado (sin outliers en price y reviews)
sns.scatterplot(
    data=filtered_price_reviews,
    x='number_of_reviews',
    y='price',
    alpha=0.4,
    color='green',
    s=50,
    ax=axs[1]
)
axs[1].set_title('RelaciΓ³n ReseΓ±as y Precio (P5-P95 en Precio y ReseΓ±as)', fontsize=title_font)
axs[1].set_xlabel('NΓΊmero de ReseΓ±as', fontsize=label_font)
axs[1].set_ylabel('Precio ($)', fontsize=label_font)
axs[1].tick_params(labelsize=tick_font)

# GrΓ‘fico de regresiΓ³n
sns.regplot(
    data=filtered_price_reviews,
    x='number_of_reviews',
    y='price',
    scatter_kws={'alpha': 0.4, 's': 50},
    line_kws={'color': 'red'},
    ax=axs[2]
)
axs[2].set_title('RegresiΓ³n lineal: Precio vs. NΓΊmero de ReseΓ±as (sin outliers)', fontsize=title_font)
axs[2].set_xlabel('NΓΊmero de ReseΓ±as', fontsize=label_font)
axs[2].set_ylabel('Precio ($)', fontsize=label_font)
axs[2].tick_params(labelsize=tick_font)

plt.tight_layout()
plt.show()
In [141]:
price_median_by_neighbourhood = price_neighbourhood.groupby('neighbourhood')['price'].median().reset_index()
price_median_by_neighbourhood.columns = ['neighbourhood', 'median_price']


price_median_by_neighbourhood['price_category'] = pd.qcut(
    price_median_by_neighbourhood['median_price'], 
    q=3, 
    labels=['Bajo', 'Mediano', 'Alto']
)


reviews_by_neighbourhood = price_neighbourhood.groupby('neighbourhood')['number_of_reviews'].agg(['mean', 'median', 'count']).reset_index()


reviews_by_neighbourhood = reviews_by_neighbourhood.merge(price_median_by_neighbourhood, on='neighbourhood')


top_15_reviews = reviews_by_neighbourhood.sort_values('mean', ascending=False).head(15)


chart = alt.Chart(top_15_reviews).mark_bar(
    stroke='black',
    strokeWidth=1
).encode(
    x=alt.X('mean:Q', 
            title='Promedio de ReseΓ±as',
            scale=alt.Scale(nice=True)),
    y=alt.Y('neighbourhood:N', 
            title='Barrio',
            sort=alt.SortField(field='mean', order='descending')),
    color=alt.Color('price_category:N',
                    scale=alt.Scale(
                        domain=['Bajo', 'Mediano', 'Alto'],
                        range=['#2ca02c', '#ffbf00', '#d62728']
                    ),
                    legend=alt.Legend(title="CategorΓ­a de Precio")),
    tooltip=[
        alt.Tooltip('neighbourhood:N', title='Barrio'),
        alt.Tooltip('mean:Q', title='Promedio de ReseΓ±as', format='.2f'),
        alt.Tooltip('median:Q', title='Mediana de ReseΓ±as', format='.2f'),
        alt.Tooltip('count:Q', title='Cantidad de Listados'),
        alt.Tooltip('median_price:Q', title='Mediana de Precio', format='.2f'),
        alt.Tooltip('price_category:N', title='CategorΓ­a de Precio')
    ]
).properties(
    width=600,
    height=400,
    title=alt.TitleParams(
        text='Top 15 Barrios con Mayor Promedio de ReseΓ±as por Alojamiento',
        fontSize=16,
        fontWeight='bold',
        anchor='start'
    )
).configure_axis(
    labelFontSize=11,
    titleFontSize=12
).configure_title(
    fontSize=14,
    fontWeight='bold'
)

chart
Out[141]:
In [142]:
reviews_by_neighbourhood = price_neighbourhood.groupby('neighbourhood')['number_of_reviews'].agg(['mean', 'median', 'count']).reset_index()
reviews_by_neighbourhood = reviews_by_neighbourhood.sort_values('median', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(data=reviews_by_neighbourhood.head(15), y='neighbourhood', x='median')
plt.title('Top 15 Barrios con Mayor Mediana de ReseΓ±as por Alojamiento')
plt.xlabel('Mediana de ReseΓ±as')
plt.ylabel('Barrio')
plt.tight_layout()
plt.show()

3. Cuales son los barrios que más han crecido en el ultimo año?

In [158]:
# Calcular la mediana de precios por barrio
price_median_by_neighbourhood = price_neighbourhood.groupby('neighbourhood')['price'].median().reset_index()
price_median_by_neighbourhood.columns = ['neighbourhood', 'median_price']

# Crear 5 categorΓ­as de precios
price_median_by_neighbourhood['price_category'] = pd.qcut(
    price_median_by_neighbourhood['median_price'],
    q=5,
    labels=['Muy Bajo', 'Bajo', 'Mediano', 'Alto', 'Muy Alto']
)

# Calcular tasa de renovaciΓ³n
review_stats = price_neighbourhood.groupby('neighbourhood')[['number_of_reviews', 'number_of_reviews_ltm']].sum().reset_index()
review_stats = review_stats[review_stats['number_of_reviews'] > 0].copy()
review_stats['renewal_rate'] = review_stats['number_of_reviews_ltm'] / review_stats['number_of_reviews']

# Unir con las categorΓ­as de precio
review_stats = review_stats.merge(price_median_by_neighbourhood, on='neighbourhood')

# Top 15 barrios con mayor tasa de renovaciΓ³n
top_renewing = review_stats.sort_values('renewal_rate', ascending=False).head(15)

# Crear grΓ‘fico con 5 categorΓ­as
chart = alt.Chart(top_renewing).mark_bar(
    stroke='black',
    strokeWidth=1
).encode(
    x=alt.X('renewal_rate:Q', title='Tasa de RenovaciΓ³n (ΓΊltimo aΓ±o / total)', scale=alt.Scale(domain=[0, 1])),
    y=alt.Y('neighbourhood:N', title='Barrio', sort=alt.SortField(field='renewal_rate', order='descending')),
    color=alt.Color('price_category:N',
                    scale=alt.Scale(
                        domain=['Muy Bajo', 'Bajo', 'Mediano', 'Alto', 'Muy Alto'],
                        range=['#1a9850', '#91cf60', '#fee08b', '#fc8d59', '#d73027']
                    ),
                    legend=alt.Legend(title='CategorΓ­a de Precio')),
    tooltip=[
        alt.Tooltip('neighbourhood:N', title='Barrio'),
        alt.Tooltip('renewal_rate:Q', title='Tasa de RenovaciΓ³n', format='.2f'),
        alt.Tooltip('number_of_reviews_ltm:Q', title='Reviews Último Año'),
        alt.Tooltip('number_of_reviews:Q', title='Reviews Totales'),
        alt.Tooltip('median_price:Q', title='Mediana Precio', format='.0f'),
        alt.Tooltip('price_category:N', title='CategorΓ­a Precio')
    ]
).properties(
    width=650,
    height=400,
    title='Top 15 Barrios con Mayor Tasa de ReseΓ±as en el ΓΊltimo aΓ±o'
).configure_axis(
    labelFontSize=11,
    titleFontSize=12
).configure_title(
    fontSize=14,
    fontWeight='bold'
)

chart
Out[158]:
In [159]:
# Calcular la mediana de precios por barrio
price_median_by_neighbourhood = price_neighbourhood.groupby('neighbourhood')['price'].median().reset_index()
price_median_by_neighbourhood.columns = ['neighbourhood', 'median_price']

# Crear 5 categorΓ­as de precio
price_median_by_neighbourhood['price_category'] = pd.qcut(
    price_median_by_neighbourhood['median_price'],
    q=5,
    labels=['Muy Bajo', 'Bajo', 'Mediano', 'Alto', 'Muy Alto']
)

# Calcular estadΓ­sticas de reseΓ±as
review_stats = price_neighbourhood.groupby('neighbourhood')[['number_of_reviews', 'number_of_reviews_ltm']].sum().reset_index()
review_stats = review_stats[review_stats['number_of_reviews'] > 0].copy()
review_stats['renewal_rate'] = review_stats['number_of_reviews_ltm'] / review_stats['number_of_reviews']

# Unir con datos de precios
review_stats = review_stats.merge(price_median_by_neighbourhood, on='neighbourhood')

# Filtrar barrios con al menos 10 reseΓ±as
filtered_stats = review_stats[review_stats['number_of_reviews'] >= 10].copy()

# Crear grΓ‘fico de dispersiΓ³n
scatter_chart = alt.Chart(filtered_stats).mark_circle(
    size=100,
    stroke='white',
    strokeWidth=1
).encode(
    x=alt.X('median_price:Q', 
            title='Precio Mediano por Barrio ($)',
            scale=alt.Scale(nice=True)),
    y=alt.Y('renewal_rate:Q', 
            title='Tasa de RenovaciΓ³n de ReseΓ±as',
            scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('price_category:N',
                    scale=alt.Scale(
                        domain=['Muy Bajo', 'Bajo', 'Mediano', 'Alto', 'Muy Alto'],
                        range=['#1a9850', '#91cf60', '#fee08b', '#fc8d59', '#d73027']
                    ),
                    legend=alt.Legend(title='CategorΓ­a de Precio')),
    size=alt.Size('number_of_reviews:Q',
                  scale=alt.Scale(range=[50, 400]),
                  legend=alt.Legend(title='Total de ReseΓ±as')),
    tooltip=[
        alt.Tooltip('neighbourhood:N', title='🏘️ Barrio'),
        alt.Tooltip('median_price:Q', title='πŸ’° Precio Mediano', format='$.0f'),
        alt.Tooltip('renewal_rate:Q', title='πŸ“Š Tasa de RenovaciΓ³n', format='.2f'),
        alt.Tooltip('number_of_reviews_ltm:Q', title='πŸ“ˆ Reviews Último AΓ±o'),
        alt.Tooltip('number_of_reviews:Q', title='πŸ“ Reviews Totales'),
        alt.Tooltip('price_category:N', title='🏷️ Categoría Precio')
    ]
).properties(
    width=700,
    height=500,
    title=alt.TitleParams(
        text='RelaciΓ³n entre Precio y Tasa de ReseΓ±as por Barrio',
        subtitle='El tamaΓ±o del punto representa el total de reseΓ±as',
        fontSize=16,
        fontWeight='bold',
        anchor='start'
    )
).configure_axis(
    labelFontSize=11,
    titleFontSize=12,
    grid=True,
    gridOpacity=0.3
).configure_title(
    fontSize=14,
    fontWeight='bold'
)

# Mostrar el grΓ‘fico
scatter_chart
Out[159]:

4. Cuál es la relación entre disponibilidad y precio?

In [147]:
avail_price = df_clean.dropna(subset=['availability_365', 'price'])
avail_price = avail_price[avail_price['price'] > 0]


correlation = avail_price['availability_365'].corr(avail_price['price'])
print(f"\nCorrelaciΓ³n entre disponibilidad y precio: {correlation:.3f}")


avail_price['availability_category'] = pd.cut(
    avail_price['availability_365'],
    bins=[0, 90, 180, 270, 365],
    labels=['Baja (0-90)', 'Media (91-180)', 'Alta (181-270)', 'Muy Alta (271-365)']
)

avail_stats = avail_price.groupby('availability_category')['price'].agg([
    'mean', 'median', 'count'
]).round(2)

print("\n Precio promedio por nivel de disponibilidad:")
print(avail_stats)

print("\n INSIGHTS ADICIONALES:")
always_available = avail_price[avail_price['availability_365'] == 365]
never_available = avail_price[avail_price['availability_365'] == 0]

print(f"Listados disponibles todo el aΓ±o: {len(always_available)} ({len(always_available)/len(avail_price)*100:.1f}%)")
print(f"Listados no disponibles: {len(never_available)} ({len(never_available)/len(avail_price)*100:.1f}%)")


print(f"Precio promedio (disponibles todo el aΓ±o): ${always_available['price'].mean():.2f}")
CorrelaciΓ³n entre disponibilidad y precio: -0.007

 Precio promedio por nivel de disponibilidad:
                            mean   median  count
availability_category                           
Baja (0-90)            134079.56  36758.0   6400
Media (91-180)          60534.03  37808.0   5829
Alta (181-270)         114939.27  40360.0   6345
Muy Alta (271-365)      83117.61  42009.0  12801

 INSIGHTS ADICIONALES:
Listados disponibles todo el aΓ±o: 1667 (5.3%)
Listados no disponibles: 223 (0.7%)
Precio promedio (disponibles todo el aΓ±o): $136687.73
In [148]:
plt.figure(figsize=(15, 10))


plt.subplot(2, 2, 1)
plt.scatter(avail_price['availability_365'].values, avail_price['price'].values, alpha=0.5)
plt.xlabel('Disponibilidad (dΓ­as al aΓ±o)')
plt.ylabel('Precio ($)')
plt.title(f'Disponibilidad vs Precio (r={correlation:.3f})')


plt.subplot(2, 2, 2)
sns.boxplot(data=avail_price, x='availability_category', y='price')
plt.xticks(rotation=45)
plt.title('Precio por CategorΓ­a de Disponibilidad')


plt.subplot(2, 2, 3)
plt.hist(avail_price['availability_365'].values, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.xlabel('DΓ­as Disponibles al AΓ±o')
plt.ylabel('Frecuencia')
plt.title('DistribuciΓ³n de Disponibilidad')


plt.subplot(2, 2, 4)
avg_prices = avail_stats['mean']

sns.barplot(x=avg_prices.index.tolist(), y=avg_prices.values, palette='coolwarm')
plt.xticks(rotation=45)
plt.title('Precio Promedio por Nivel de Disponibilidad')
plt.ylabel('Precio Promedio ($)')

plt.tight_layout()
plt.show()
In [161]:
avail_stats_reset = avail_stats.reset_index()
avail_stats_reset['percentage'] = (avail_stats_reset['count'] / avail_stats_reset['count'].sum() * 100).round(1)
avail_stats_reset['houses_needed'] = avail_stats_reset['percentage'].round().astype(int)

isotype_data = []
house_id = 0

for _, row in avail_stats_reset.iterrows():
    category = row['availability_category']
    houses = int(row['houses_needed'])
    count = row['count']
    percentage = row['percentage']
    price_avg = row['mean']
    
    for i in range(houses):
        col = house_id % 10
        row_pos = house_id // 10
        isotype_data.append({
            'x': col,
            'y': row_pos,
            'category': category,
            'count': count,
            'percentage': percentage,
            'price_avg': price_avg,
            'house_id': house_id,
        })
        house_id += 1

isotype_df = pd.DataFrame(isotype_data)

isotype_data = []

emoji_mapping = {
    'Baja (0-90)': 'πŸ”₯',
    'Media (91-180)': '🏑', 
    'Alta (181-270)': '🏒',
    'Muy Alta (271-365)': '🏘️'
}

# Crear etiquetas que incluyan los emojis
isotype_df['emoji'] = isotype_df['category'].map(emoji_mapping)
isotype_df['category_with_emoji'] = isotype_df['emoji'] + ' ' + isotype_df['category']

house_chart = alt.Chart(isotype_df).mark_text(
    fontSize=25,
    baseline='middle',
    align='center'
).encode(
    x=alt.X('x:O', axis=None),
    y=alt.Y('y:O', axis=None, sort='descending'),
    text='emoji:N',
    color=alt.Color('category_with_emoji:N',
                    scale=alt.Scale(
                        domain=['πŸ”₯ Baja (0-90)', '🏑 Media (91-180)', '🏒 Alta (181-270)', '🏘️ Muy Alta (271-365)'],
                        range=['#e74c3c', '#f39c12', '#f1c40f', '#27ae60']
                    ),
                    legend=alt.Legend(
                        title="🏠 Nivel de Disponibilidad",
                        orient="bottom",
                        columns=2,
                        titleFontSize=14,
                        labelFontSize=12,
                        symbolSize=100
                    )),
    tooltip=[
        alt.Tooltip('category:N', title='🏠 Disponibilidad'),
        alt.Tooltip('count:Q', title='πŸ“Š Propiedades', format=','),
        alt.Tooltip('percentage:Q', title='πŸ“ˆ Porcentaje', format='.1f'),
        alt.Tooltip('price_avg:Q', title='πŸ’° Precio Promedio', format='$.0f')
    ]
).properties(
    width=500,
    height=300,
    title=alt.TitleParams(
        text=[
            '🏘️ Distribución de Disponibilidad - Airbnb Buenos Aires',
            f'Cada casa = ~1% del total | Total: {avail_stats_reset["count"].sum():,} propiedades'
        ],
        fontSize=16,
        fontWeight='bold',
        anchor='middle',
        subtitleFontSize=12,
        subtitleColor='#666666'
    )
).configure_view(strokeWidth=0)

house_chart
Out[161]:
In [ ]: